Chinese Word Searching In Imaged Documents

نویسندگان

  • Yue Lu
  • Chew Lim Tan
چکیده

An approach to searching for user-specified words in imaged Chinese documents, without the requirements of layout analysis and OCR processing of the entire documents, is proposed in this paper. A small number of Chinese characters that cannot be successfully bounded using connected component analysis due to larger gaps between elements within the characters are blacklisted. A suitable character that is not included in the blacklist is chosen from the user-specified word as the initial character to search for a matching candidate in the document. Once a matched candidate is found, the adjacent characters in the horizontal and vertical directions are examined for matching with other corresponding characters in the user-specified word, subject to the constraints of alignment (either horizontal or vertical direction) and size similarity. A weighted Hausdorff distance is proposed for the character matching. Experimental results show that the present method can effectively search the user-specified Chinese words from the document images with the format of either horizontal or vertical text lines, or both appearing on the same image.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Survey on Various Word Spotting Techniques for Content Based Document Image Retrieval

Searching documents for information and retrieval of relevant documents is a basic activity. Various tools are readily available for searching and retrieval from digital documents, but not much robust methods are available for retrieval from historic documents and old manuscripts as they are not digitized but available in scanned formats. Conventional way of retrieval from scanned document imag...

متن کامل

Topic Modeling of Chinese Language Using Character-Word Relations

Topic models are hierarchical Bayesian models for language modeling and document analysis. It has been well-used and achieved a lot of success in modeling English documents. However, unlike English and the majority of alphabetic languages, the basic structural unit of Chinese language is character instead of word, and Chinese words are written without spaces between them. Most previous research...

متن کامل

Character confidence based on N-best list for keyword spotting in online Chinese handwritten documents

In keyword spotting from handwritten documents by text query, the word similarity is usually computed by combining character similarities, which are desired to approximate the logarithm of the character probabilities. In this paper, we propose to directly estimate the posterior probability (also called confidence) of candidate characters based on the N-best paths from the candidate segmentation...

متن کامل

A word language model based contextual language processing on Chinese character recognition [7534-22]

The language model design and implementation issue is researched in this paper. Different from previous research, we want to emphasize the importance of n-gram models based on words in the study of language model. We build up a word based language model using the toolkit of SRILM and implement it for contextual language processing on Chinese documents. A modified Absolute Discount smoothing alg...

متن کامل

A word language model based contextual language processing on Chinese character recognition

The language model design and implementation issue is researched in this paper. Different from previous research, we want to emphasize the importance of n-gram models based on words in the study of language model. We build up a word based language model using the toolkit of SRILM and implement it for contextual language processing on Chinese documents. A modified Absolute Discount smoothing alg...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • IJPRAI

دوره 18  شماره 

صفحات  -

تاریخ انتشار 2004